crc-version)Note: An html version of this document, which includes the evaluated result of all code chunks, can be found at https://nzilbb.github.io/How-to-Train-Your-Classifier/How_to_Train_Your_Classifier.html.
startTime <- proc.time()
startSysTime <- Sys.time()
setupEndTime <- proc.time()
crc-version branchThis document was created to test the performance of the Center for Research Computing high-performance computing cluster, and to save the data for later. Because Pitt’s HPC cluster setup differs from the University of Canterbury HPC cluster on which this document was originally run, the output data in this document isn’t identical to that of the original. Information on cluster performance (i.e., timing) can be found below. This repo also contains the Slurm shell script that was used to run this document.
This guide walks readers through the process of training a random-forest classifier for automated coding of sociophonetic variation, using the statistical computing language R (R Core Team, 2018) and the packages ranger (Wright & Ziegler, 2017) and caret (Kuhn, 2018). This guide is intended to be a companion to our manuscript “From categories to gradience: Auto-coding sociophonetic variation with random forests”, and it recreates the classifier of Southland English non-prevocalic /r/ discussed in the manuscript. By consulting the R Markdown file that generated this document, readers can run the code on their own systems to recreate the /r/ classifier.
IMPORTANT: While you can run all this code, we don’t recommend trying to run it all at once (e.g., by ‘knitting’ the R Markdown document); the code that generated this document was tailored to a machine with 32 processors and still takes several hours to run start-to-finish. As we mention below, training a classifier is not a one-size-fits-all process.
This guide goes through the following steps:
Pre-process data
Determine classifier performance measure
Train an initial classifier
Tune hyperparameters
Determine effect of outliers
Train final classifier
Auto-code new tokens and extract classifier probabilities for hand-coded tokens
Perhaps the most important thing to know is that training a classifier is not a one-size-fits-all process. Your decisions about how to go about training a classifier will change depending on your data, your variables, your research questions, your standards for what counts as adequate performance, etc. This guide describes the process for one particular set of data, a particular variable, particular research questions, etc., and shouldn’t be taken as an exact template for every possible case.
This guide is intended to be accessible to intermediate to advanced R users. Some of the coding techniques used in this guide are fairly advanced, but you won’t need to understand all the code to use this guide or to train your own classifiers.
This guide also includes text that’s visible only on the R Markdown version of the document (to avoid cluttering the html version) giving more detail about the implementation of the code. Like this paragraph, R Markdown-only text in this document begins with a ‘greater than’ sign.
The steps in this guide can be used by any computer that can run R. That said, progressing through all of the steps listed here can be very time-consuming, especially with large amounts of data. As a result, all of the forests in this guide were generated on a computing cluster running Ubuntu 16.04 with 32 virtual cores and 64 GB of memory, in order to leverage R’s support for parallel computing. (Despite this powerful hardware, the code in this guide takes over 4 hours to run in terms of pure running time.) The code used here for running models in parallel (which uses the package doParallel) should work on all systems, although the user will need to adjust the number of cores to their system (you can find out how many cores are available via the detectCores() function in the parallel package). Note that train() (the function we use to train classifiers in this guide) by default uses a parallel backend when one is available (e.g., to train multiple resamples at a time), so parallelization is used even in instances where the end-user doesn’t explicitly invoke it using doParallel methods.
Since random forests are a stochastic process, results can be slightly different with each run, so this guide uses random seeds to ensure reproducibility. In single forest runs, it suffices to use set.seed() prior to running train(). In situations where doParallel methods are used to run a list of forests differing by a hyperparameter, it’s necessary to use clusterSetRNGStream() after registering the cluster and before calling foreach(); this ensures that each core starts with the same random seed. Finally, in Step 5 when we want to run lists of non-identical forests that don’t differ by any parameters, we use the seeds argument of trainControl().
Finally, our code creates ‘dummy files’, with no extension or contents, in a Model Status subfolder to help us keep track of when the sub-models finish running. This technique is especially useful when you expect some sub-models to take longer than others (e.g., repeated k-fold cross-validation will naturally take longer than non-repeated k-fold cross-validation). Although we’re only saving ‘dummy files’ here, you can also save more useful files; for example, if the overall model is too big to store in memory and the model crashes before all sub-models are complete, you can replace the second ‘dummy file’ with a command to save each sub-model when it’s finished running or a summary of the sub-model.
if (!dir.exists("Model Status")) dir.create("Model Status")
To run the code in this guide, you’ll need R and several R packages. In particular, this code relies heavily on the collection of packages called the tidyverse, which make R code faster, more readable, and safer. Though you can run this code without it, we recommend the RStudio integrated development environment for ease of understanding code. Information on the system, R version, and package versions that generated this guide can be found at the end of this guide.
As a prerequisite to the steps we describe in this guide, you’ll also need data that has been hand-coded into two or more variants and for which you have all the acoustic measures you’ll want to enter into the classifier. In the case of our /r/ classifier (for which we provide the data as an example), we had several thousand hand-coded tokens, for which we extracted 180 acoustic measures using the “Process with Praat” function in LaBB-CAT (Fromont & Hay, 2012); these measures were motivated by previous literature on the acoustic correlates of rhoticity in English and other languages (e.g., Lawson, Stuart-Smith, & Scobbie, 2018; Heselwood, 2009). More details on the choice of these measures can be found in the main text.
Load the /r/ data, which includes 40614 rows and 217 columns, the last 180 of which are acoustic measures to be fed into the classifier.
The first 37 columns contain mostly linguistic information, plus LaBB-CAT metadata (TokenNum and MatchId) and speaker data (Speaker, which has been anonymized, and Gender). To preserve anonymity, both Speaker and Word have been anonymized to dummy factors. Stress is coded as either ’ (primary word stress), " (secondary word stress), or 0 (unstressed). Syllable and columns ending in “DISC” are phonemic strings represented in the DISC phonetic alphabet used by CELEX (see https://catalog.ldc.upenn.edu/docs/LDC96L14/eug_let.pdf); FollSegmentNoPause and FollSegment are represented in the ARPAbet codes used by the CMU pronouncing dictionary, without stress markers on vowels (see http://www.speech.cs.cmu.edu/cgi-bin/cmudict); and Vowel is represented in Wells lexical sets as they relate to New Zealand English (see https://www.phon.ucl.ac.uk/home/wells/stanlexsets.htm). HowCoded is used to separate tokens that have already been hand-coded from those to be auto-coded (which therefore have NA for Rpresent). All other columns should be self-explanatory.
trainingData <- readRDS("Data/RClassifierData_03July2019.Rds")
trainingData %>% dim()
[1] 40614 217
trainingData %>% colnames()
[1] "TokenNum" "Speaker" "Gender"
[4] "MatchId" "Stress" "FollPause"
[7] "CelexFreq" "FollPauseDur" "Syllable"
[10] "SyllStart" "SyllEnd" "Word"
[13] "WordStart" "WordEnd" "TokenDISC"
[16] "TokenStart" "TokenEnd" "TokenGenAmDISC"
[19] "TokenGenAmStart" "TokenGenAmEnd" "Rpresent"
[22] "FollSegmentDISC" "RType" "PrevRInWord"
[25] "ContFuncWord" "CorpusFreq" "Vowel"
[28] "NURSEvsNonNURSE" "FollSegmentNoPause" "FollSegment"
[31] "SyllFinal" "WordFinal" "HowCoded"
[34] "Initial_F1_50" "Initial_F2_50" "Initial_F3_50"
[37] "Initial_F4_50" "TokenDur" "F1_20"
[40] "diffF3F1_20" "BW1_20" "F1_25"
[43] "diffF3F1_25" "BW1_25" "F1_30"
[46] "diffF3F1_30" "BW1_30" "F1_35"
[49] "diffF3F1_35" "BW1_35" "F1_40"
[52] "diffF3F1_40" "BW1_40" "F1_45"
[55] "diffF3F1_45" "BW1_45" "F1_50"
[58] "diffF3F1_50" "BW1_50" "F1_55"
[61] "diffF3F1_55" "BW1_55" "F1_60"
[64] "diffF3F1_60" "BW1_60" "F1_65"
[67] "diffF3F1_65" "BW1_65" "F1_70"
[70] "diffF3F1_70" "BW1_70" "F1_75"
[73] "diffF3F1_75" "BW1_75" "F1_80"
[76] "diffF3F1_80" "BW1_80" "F1min"
[79] "F1max" "time_F1min" "time_F1max"
[82] "F1range" "time_F1range" "slopeF1"
[85] "F2_20" "diffF3F2_20" "BW2_20"
[88] "F2_25" "diffF3F2_25" "BW2_25"
[91] "F2_30" "diffF3F2_30" "BW2_30"
[94] "F2_35" "diffF3F2_35" "BW2_35"
[97] "F2_40" "diffF3F2_40" "BW2_40"
[100] "F2_45" "diffF3F2_45" "BW2_45"
[103] "F2_50" "diffF3F2_50" "BW2_50"
[106] "F2_55" "diffF3F2_55" "BW2_55"
[109] "F2_60" "diffF3F2_60" "BW2_60"
[112] "F2_65" "diffF3F2_65" "BW2_65"
[115] "F2_70" "diffF3F2_70" "BW2_70"
[118] "F2_75" "diffF3F2_75" "BW2_75"
[121] "F2_80" "diffF3F2_80" "BW2_80"
[124] "F2min" "F2max" "time_F2min"
[127] "time_F2max" "F2range" "time_F2range"
[130] "slopeF2" "F3_20" "BW3_20"
[133] "F3_25" "BW3_25" "F3_30"
[136] "BW3_30" "F3_35" "BW3_35"
[139] "F3_40" "BW3_40" "F3_45"
[142] "BW3_45" "F3_50" "BW3_50"
[145] "F3_55" "BW3_55" "F3_60"
[148] "BW3_60" "F3_65" "BW3_65"
[151] "F3_70" "BW3_70" "F3_75"
[154] "BW3_75" "F3_80" "BW3_80"
[157] "F3min" "F3max" "time_F3min"
[160] "time_F3max" "F3range" "time_F3range"
[163] "slopeF3" "intens_F3min" "intens_F3max"
[166] "F4_20" "diffF4F3_20" "BW4_20"
[169] "F4_25" "diffF4F3_25" "BW4_25"
[172] "F4_30" "diffF4F3_30" "BW4_30"
[175] "F4_35" "diffF4F3_35" "BW4_35"
[178] "F4_40" "diffF4F3_40" "BW4_40"
[181] "F4_45" "diffF4F3_45" "BW4_45"
[184] "F4_50" "diffF4F3_50" "BW4_50"
[187] "F4_55" "diffF4F3_55" "BW4_55"
[190] "F4_60" "diffF4F3_60" "BW4_60"
[193] "F4_65" "diffF4F3_65" "BW4_65"
[196] "F4_70" "diffF4F3_70" "BW4_70"
[199] "F4_75" "diffF4F3_75" "BW4_75"
[202] "F4_80" "diffF4F3_80" "BW4_80"
[205] "F4min" "F4max" "time_F4min"
[208] "time_F4max" "F4range" "time_F4range"
[211] "slopeF4" "F0min" "F0max"
[214] "F0rangeST" "time_F0min" "time_F0max"
[217] "absSlopeF0"
Due to a fair amount of measurement error in automatically extracting formant and pitch measurements, and to account for the degree to which formant frequencies are impacted by vocal tract length, we subjected /r/ measurements to pre-processing before entering data into the classifier. Your data might not need all these steps—we didn’t need to do nearly as much pre-processing for the medial /t/ data—but we’ll walk through them for your reference.
First, we removed tokens that either had bad durations or immediately preceded another /r/. Checking durations is a way to filter out tokens whose forced alignment is bad, since individual /r/ tokens are unlikely to be long (except perhaps in word lists or utterance-finally). We randomly sampled tokens in a variety of duration bands (e.g., [500 ms, 550 ms)) to check for misalignment; this procedure led to a ceiling of 450 ms. A handful of tokens somehow had zero or negative durations, so we removed these tokens, as well.
trainingData <-
trainingData %>%
filter(TokenDur > 0, TokenDur < 0.45, FollSegment!="R")
Second, we normalized F1–F4 measurements (timepoint measurements and maxima/minima) by subtracting the speaker’s mean word-initial /r/ midpoint measurement for that formant from the raw measurement. As with vowels, formant measurements for rhotics are affected by vocal tract length, but whereas normalization methods abound for vowel formant frequencies, there is no generally accepted way to normalize formants in rhotics. Our method of using each speaker’s initial /r/ formants as a baseline measure accounts for vocal tract length, following Hay and Clendon’s (2012) approach of modeling rhoticity by comparing non-prevocalic /r/ F3 against initial /r/ F3.
trainingData <-
trainingData %>%
mutate_at(vars(matches("^F1(_[2-8][50]|min|max)$")), ~ . - Initial_F1_50) %>%
mutate_at(vars(matches("^F2(_[2-8][50]|min|max)$")), ~ . - Initial_F2_50) %>%
mutate_at(vars(matches("^F3(_[2-8][50]|min|max)$")), ~ . - Initial_F3_50) %>%
mutate_at(vars(matches("^F4(_[2-8][50]|min|max)$")), ~ . - Initial_F4_50)
Third, we centered and scaled token duration by speaker and vowel; in cases where token duration was invariant within a speaker \(\times\) vowel distribution, we used a value of 0.
trainingData <-
trainingData %>%
group_by(Speaker, Vowel) %>%
mutate(TokenDurSD = sd(TokenDur, na.rm=TRUE) %>% replace_na(0),
TokenDur = if_else(TokenDurSD > 0, scale(TokenDur), 0)) %>%
select(-TokenDurSD) %>%
ungroup()
Fourth, to deal with tokens that had missing pitch and/or F4 measurements (including F4 bandwidth), we imputed missing F0 and F4 measurements using k Nearest Neighbor imputation (with k = 10) via the function knnImputation() in the package DMwR (Torgo, 2010). Since random forests cannot classify observations with missing measurements, the imputation of missing measurements is common practice in machine learning contexts (e.g., Breen, Thomas, Baldwin, & Lipinska, 2019; Hudak, Crookston, Evans, Hall, & Falkowski, 2008). At the same time, this step won’t necessarily apply to every data set (we didn’t need to impute any measurements for the medial /t/ data). Note that this next step can be fairly time-consuming (on the cluster, it takes around 15 minutes).
For time-consuming steps such as imputation, we use caching to save time by avoiding re-running code. If we give the chunk a unique name and set
cache=params$testing, R Markdown will save the results of the code chunk to a cache when initially knitting the file, such that on subsequent knits the results will simply be re-loaded. This can save a considerable amount of time, especially if you want to tweak text that’s outside of the cached chunks or plots that use data generated by cached chunks, etc. More info on caching can be found at https://yihui.name/knitr/demo/cache/ and in theknitrmanual at https://yihui.name/knitr/demo/manual/.
##Capture current column ordering
origCols <- trainingData %>% colnames()
##Set up dataframe for imputation: Must contain only the measures to be imputed
## and the measures being used to calculate k nearest neighbors.
imputeDF <-
trainingData %>%
##In this data, MatchId serves as a unique identifier for tokens
select(MatchId, TokenDur:absSlopeF0) %>%
select(-contains("F0"), -contains("F4"), -contains("BW4")) %>%
select_if(~ sum(is.na(.)) < 300) %>%
filter(complete.cases(.)) %>%
left_join(trainingData %>%
select(MatchId, TokenDur:absSlopeF0) %>%
select(MatchId, contains("F0"), contains("F4"), contains("BW4")),
by="MatchId")
##Perform imputation--this takes a while!
imputeDF <-
imputeDF %>%
select(-MatchId) %>%
as.data.frame() %>%
knnImputation() %>%
cbind(imputeDF %>% select(MatchId))
##Add imputed values back into data frame (MatchId is a unique identifier, so
## we can use it to match tokens between data frames)
trainingData <-
trainingData %>%
left_join(imputeDF %>%
select(MatchId, contains("F0"), contains("F4"), contains("BW4")),
by="MatchId") %>%
select(-ends_with(".x")) %>%
rename_at(vars(ends_with(".y")), str_remove, ".y") %>%
select(!!origCols)
##Clean up
rm(imputeDF)
Fifth, we marked numerous measures for potential outlier-hood: formant timepoint measures, formant minima/maxima, pitch timepoint measures, pitch minima/maxima, intensity measures, and duration. Potential outliers were measurements that were more than 1.5 \(\times\) the interquartile range (third quartile minus first quartile) outside the first or third quartile for that measure’s distribution by speaker and by vowel. We won’t actually exclude outliers now; instead, we’ll use outlier data in Step 5 below to figure out which outliers we can afford to keep, and which ones we’re better off discarding. As with imputation, outlier-marking isn’t necessary or appropriate for all situations; we give more details in the subsection following this one.
trainingData <-
trainingData %>%
group_by(Speaker, Vowel) %>%
mutate_at(vars(matches("^F[01234](_[2-8][50]|min|max)$"), starts_with("intens_"), TokenDur),
list(Outlier = ~ . < quantile(., 0.25, na.rm=TRUE) - 1.5*IQR(., na.rm=TRUE) |
. > 1.5*IQR(., na.rm=TRUE) + quantile(., 0.75, na.rm=TRUE))) %>%
ungroup()
Finally, we excluded tokens that had missing measurements for measures other than F0 and F4.
trainingData <-
trainingData %>%
filter_at(vars(TokenDur:absSlopeF0), all_vars(!is.na(.))) %>%
droplevels()
Let’s save our progress by writing the training data to an RDS file.
saveLoadStartTime <- proc.time()
if (!dir.exists("Data")) dir.create("Data")
trainingData %>% saveRDS("Data/trainingData.Rds")
trainingData <- readRDS("Data/trainingData.Rds")
saveLoadEndTime <- proc.time()
Measurement error is a fact of life when using corpus methods, and of course classifiers will perform better if they have cleaner data. At the same time, not all data is appropriate for outlier-marking. Here are the conditions under which outlier-marking is appropriate:
Some measures include measurement error. If your data is spotless, then any outliers are ‘true’ outliers, not just mis-measurements, and your model will be better off for being able to see them.
For the measures with some measurement error, the distributions are either unimodal or their bi-/multimodality isn’t due to the classes of interest.
Let’s briefly investigate these conditions. First, do we have evidence of measurement error in our data? Looking at F3 minimum, we appear to have a good distribution, but the extent of it seems questionable (do we really have tokens whose F3 minimum is below 1000 Hz?). Similarly, the intensity at the time of the F3 minimum has some values that seem conspicuously low. When we hand-check these tokens, we find that these measurements are indeed erroneous, meaning there’s evidence for measurement error in our data:
trainingData %>%
filter(!is.na(F3min)) %>%
mutate(F3min_orig = F3min + Initial_F3_50) %>%
ggplot(aes(x=F3min_orig)) +
geom_histogram(binwidth=50) +
xlab("F3 minimum (Hz)")
trainingData %>%
filter(!is.na(intens_F3min)) %>%
ggplot(aes(x=intens_F3min)) +
geom_histogram(binwidth=1) +
xlab("Intensity at time of F3 minimum (dB)")
Once we’ve detected measurement error, we have to determine whether these measures are unimodally distributed. Rather than just eyeball the distributions, we’ll conduct Hartigan’s dip tests, which determine whether there is significant evidence for non-unimodality in a distribution (Freeman & Dale, 2013; Hartigan & Hartigan, 1985), using the diptest package (Maechler, 2016). The null hypothesis in these tests is that the distribution is unimodal, so a significant result means that we have evidence that the distribution is bimodal or multimodal. The dip test on the F3 minimum distribution indicates no evidence for non-unimodality, but (perhaps surprisingly) the dip test on the intensity at F3 minimum distribution indicates significant evidence for non-unimodality. In other words, F3 minimum has met the conditions for outlier-marking, but intensity hasn’t.
trainingData$F3min %>% dip.test()
Hartigans' dip test for unimodality / multimodality
data: .
D = 0.0010115, p-value = 0.9962
alternative hypothesis: non-unimodal, i.e., at least bimodal
trainingData$intens_F3min %>% dip.test()
Hartigans' dip test for unimodality / multimodality
data: .
D = 0.021, p-value < 2.2e-16
alternative hypothesis: non-unimodal, i.e., at least bimodal
The reason for the bimodality condition is that we don’t want to outlier-mark a bimodal distribution that is merely the combinaton of two unimodal distributions that each correspond to one class. If underlying the distribution of some measure are two fairly separate unimodal distributions by class, then outlier-marking is going to eliminate some measurements that are extreme with respect to the pooled distribution but are actually well-behaving measurements for their own class. As a result, if our bimodal distribution isn’t the result of the classes of interest, but instead the result of some other factor external to the classes, then it’s okay to outlier-mark the distribution (although it might be beneficial to normalize for the external factor if that’s possible). To that end, we can look at the by-class distribution of intensity; since we don’t have evidence for different modes for Absent vs. Present, intensity meets the conditions for outlier-marking.
trainingData %>%
filter(HowCoded=="Hand") %>%
ggplot(aes(x=intens_F3min, fill=Rpresent)) +
geom_density(alpha=0.5) +
xlab("Intensity at time of F3 minimum (dB)")
We don’t have any examples of distributions whose bimodality is due to the Absent/Present in the /r/ data, but in the medial /t/ data, the center of gravity at 70% duration is clearly bimodal, and this is clearly due to Voiced tokens having a lower center of gravity than Voiceless tokens.
Plot: Medial /t/ center of gravity at 70% by class
Now that we’ve determined that we can outlier-mark these measures, what criteria do we use to mark individual measurements as outliers? Two potential rules of thumb are to mark measurements that are more than 2 standard deviations from the mean, or to mark measurements that are more than 1.5 times the interquartile range from the first or third quartiles. The standard deviation-based rule is only appropriate if the data is normal, while the IQR-based rule is more general. In this case, since we had a lot of measures to outlier-mark, some of which are clearly non-normal (e.g., intensity at F3 minimum), we used the more general IQR rule of thumb. We also generated distributions within each speaker and vowel, since a well-behaving measurement in a rare vowel context can easily look like an outlier in the overall data.
Before tuning hyperparameters, it’s necessary to choose a classifier performance metric that the process of hyperparameter tuning will optimize. If you simply want the most accurate classifier possible, optimize for overall accuracy; if it’s overwhelmingly important that a particular class be accurately represented, optimize for that class’s accuracy. In our case, we wanted our classifiers to be highly accurate, but we also wanted our classifiers to balance true positives and true negatives; the latter aim was especially important in light of the class imbalance between Present and Absent in the /r/ training data. As a result, we tuned hyperparameters to optimize the area under the ROC curve (AUC) \(\times\) overall accuracy (see the Methods section of the main text). We also wanted to monitor the class accuracies of Present and Absent during the tuning process, even though we didn’t choose hyperparameters to optimize class accuracies.
The package that we used to run the classifiers, caret (Kuhn, 2018), allows the user to specify custom summary functions to report summary statistics, which is useful for hyperparameter tuning. We used the following function (a lightly modified version of caret’s multiClassSummary()) to report several summary statistics, including overall accuracy \(\times\) AUC.
cutoffSummary <- function(data, lev=NULL, model=NULL, obsCol="obs", returnDF=FALSE) {
require(ROCR)
require(tidyverse)
require(magrittr)
if (!is.null(obsCol) & obsCol!="obs") data <- data %>% rename("obs" = obsCol)
preds <- data %$% prediction(Present, obs)
auc <-
preds %>%
performance(measure="auc") %>%
attr("y.values") %>%
extract2(1)
cutoff <-
preds %>%
performance(measure="acc") %>%
attributes() %>%
extract(c("x.values", "y.values")) %>%
map_dfc(extract2, 1) %>%
filter(y.values==max(y.values)) %>%
pull(x.values) %>%
median()
data <- data %>%
mutate(PredCutoff = if_else(Present >= cutoff, "Present", "Absent"),
PredHalf = if_else(Present >= 0.5, "Present", "Absent"),
AccCutoff = PredCutoff==obs,
AccHalf = PredHalf==obs)
class_means <-
data %>%
group_by(obs) %>%
summarise_at(vars(starts_with("Acc")), mean) %>%
gather(Type, Acc, starts_with("Acc")) %>%
mutate(Type = str_c(Type, "_", obs)) %>%
select(Type, Acc) %>%
spread(Type, Acc) %>%
rename_all(str_replace, "Acc", "ClassAccuracy") %>%
rename_all(str_remove, "Cutoff")
# confMat <- data %$% table(Predicted = PredCutoff, Actual = obs)
ret <- data.frame(AccAUC = mean(data$AccCutoff)*auc,
AUC = auc,
Accuracy = mean(data$AccCutoff),
BestCutoff = cutoff,
AccuracyHalf = mean(data$AccHalf)) %>%
cbind(class_means)
if (!returnDF) ret <- ret %>% map_dbl(I)
return(ret)
}
First, we train an initial classifier using the default settings of train(). This initial classifier will help us understand what sort of performance we can expect from our final model; although we’ll tune hyperparameters and exclude outliers to improve the model’s performance, in reality the final model’s performance won’t be drastically better than the initial model.
In terms of syntax, we feed train() a formula, the data, a model type, a variable importance mode, and a trainControl object that contains some more details about how the model is to be run:
The formula describes the dependent variable we’re trying to classify (Rpresent) and says that everything else in the data (.) will serve as independent variables.
For the data, we use just the hand-coded tokens from the training data (recall that we’ve already filtered out tokens with missing measurements, etc., in Step 1), and just the columns containing the dependent variable and the acoustic measures that we want to use in training our classifier (hence why the formula has . on the right-hand side).
We specify the "ranger" method for running random forests because train() can use over 200 models for classification and regression.
The variable importance mode will determine how ranger() ranks variables in terms of importance (here, "impurity" is the Gini index).
Finally, the various arguments of trainControl() tell train() to save the resampled classification predictions that the model generates for the hand-coded data, use our custom summary function (so we can use a non-default cutoff and calculate our chosen performance measure, overall accuracy \(\times\) AUC), save info on which tokens were in which resamples, and calculate not only binary classifications but also classifier probabilities (which is necessary for calculating AUC).
Because we haven’t specified anything else (e.g., number of trees, resampling method), train() will use its defaults for the various hyperparameters we’ll tune later.
fstInitStartTime <- proc.time()
##Run the model using a single core
registerDoSEQ()
##RF are a stochastic process, so ensure reproducibility by setting a seed
set.seed(302302)
fstInit <-
train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
trControl=trainControl(savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE))
Loading required package: ROCR
fstInitEndTime <- proc.time()
The fstInit model we’ve just generated has lots of information inside of it. To pull out the most salient information from the model—the resampled performance results, an average confusion matrix, the variable importance scores, and timing information—we’ll use a summary function. This summary function will also come in handy because the forest models are large files (fstInit on its own is 80.6 Mb), and they’ll add up when we’re generating large lists of them in Steps 4 and 5, so we’ll just save the summaries to disk rather than the forests.
To the last point about only saving the summaries: if we’re caching the chunks that are used to run the forests, the forests will be saved to disk as well; cache with caution. This file caches the forest-running chunks on a computing cluster with lots of extra space, but if you don’t have lots of extra space to use up on your drive, we wouldn’t recommend caching forests (especially if you set random seeds to ensure that you’d get the same results every time); the forests run in this document take up about 4 Gb.
fstSummary <- function(x) {
require(tidyverse)
##Create a simple summary with resampled performance measures, confusion
## matrix, variable importance, and timing info
simpleSummary <- function(x) {
resample <-
x %>%
extract2("resample") %>%
select(AccAUC, Accuracy, AUC, everything())
cutoff <-
resample %>%
pull(BestCutoff) %>%
mean(na.rm=TRUE)
confMat <-
x %>%
extract2("pred") %>%
mutate(pred = if_else(Present >= cutoff, "Present", "Absent")) %$%
table(Predicted = pred, Actual = obs)
varImp <-
x %>%
extract2("finalModel") %>%
extract2("variable.importance")
times <- x %>%
extract2("times") %>%
extract(-3)
list(resample = resample, confMat = confMat,
varImp = varImp, times = times)
}
##If x is a single forest, return a simple summary; if x is a list of
## forests, make each summary element a list of results for each forest
## (except for resample, which combines the results for all forests and adds
## a Model column)
if ("train" %in% class(x)) {
smry <- simpleSummary(x)
} else if (x %>% map_lgl(~ "train" %in% class(.x)) %>% all()) {
if (x %>% names() %>% is.null()) names(x) <- seq_along(x)
smry <-
x %>%
map(simpleSummary) %>%
transpose()
smry$resample <-
smry$resample %>%
imap_dfr(~cbind(Model = .y, .x, stringsAsFactors=FALSE)) %>%
mutate_at(vars(Model), factor, levels=names(x))
}
smry
}
For example, we can calculate the mean of resampled performance measures from the resample element of our summary object. Here, we see that the mean overall accuracy \(\times\) AUC of this initial classifier was 0.6972862:
smryInit <- fstInit %>% fstSummary()
smryInit$resample %>%
summarise_if(is.numeric, mean, na.rm=TRUE)
Or we can look at timing, in the times element. Here, we see that the classifier took 10 minutes to run, but if we’d run it on a single core rather than with a parallel backend, it would’ve taken 203 minutes to run:
smryInit$times
$everything
user system elapsed
8970.570 13.794 894.277
$final
user system elapsed
43.540 0.023 4.218
Of course, this is just one way to summarize a forest; depending on your needs, you might want to pull out different information. Regardless, we’ll save this summary (and not the forest itself) to the Model Status directory, and we’ll remove the forest from our workspace to save on memory:
smryInit %>% saveRDS("Model Status/SmryInit.Rds")
rm(fstInit)
In this step, we adjust several hyperparameters to find the settings that will yield the best-performing model (with “best-performing” defined with respect to our chosen performance measure, of course). These hyperparameters control how the random forest classifier is run in ranger and how caret determines classifier performance. In particular, caret takes the training data and generates training and test subsets to calculate classifier performance measures, optionally performing additional resampling on these training and test subsets.
We tune the following hyperparameters:
Additional resampling to perform on the training and test subsets (trainControl(sampling))
The method for generating training and test subsets from the training set, for the purpose of measuring performance (the method argument of trainControl())
Parameters controlling the generation of training and test subsets, depending on the method: number of folds/resampling iterations (trainControl(number)), number of repeated k-fold cross-validation iterations (trainControl(repeats)), and/or leave-p-out cross-validation training percentage (trainControl(p))
Parameters of the ranger model (controlled via the tuneGrid argument of train()):
The number of variables to test at each node (mtry)
Node splitting rule (splitrule)
Minimum node size (min.node.size)
The number of trees to generate (train(num.trees))
Because we have unbalanced classes in our hand-coded data, this additional resampling is likely to have the greatest impact on classifier performance, hence why we tune this hyperparameter first.
After constructing training and test subsets, train() can perform additional resampling to adjust for class imbalances in the training and test subsets prior to assessing classifier performance. By default, train() performs no additional resampling, but other options are downsampling the majority class to match the size of the minority class, and SMOTE sampling, a hybrid method that combines downsampling the majority class with creating synthetic members of the minority class (Chawla, Bowyer, Hall, & Kegelmeyer, 2002). Given that the data are imbalanced, it’s likely that a model that looks at the data as-is will perform well on the majority class, Absent, but poorly on the minority class, Present. For all the other hyperparameters, we start with default values (method="boot", number=25, num.trees=500, tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=1)).
resampStartTime <- proc.time()
##Define a vector with possible values of our hyperparameter
tuneResamples <- c("none", "down", "smote")
##Register a parallel cluster with 3 cores (the length of our hyperparameter
## vector)
cl <- makeCluster(tuneResamples %>% length())
registerDoParallel(cl)
##Set a parallel seed for reproducibility
clusterSetRNGStream(cl, 302302)
##Train the list of forests; the code inside the curly braces will be run in
## parallel for each value of the vector tuneResamples, outputting a classifier
## for each setting
fstListResamples <-
foreach(tuneRsmp = tuneResamples,
.packages=c("caret","dplyr")) %dopar% {
##Create an extensionless 'dummy file' to signal that the model has
## started running
file.create(paste("Model Status/Resamples", tuneRsmp, "begun", sep="_"))
##Call train()
mod <- train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=500,
trControl=trainControl(method="boot",
number=25,
sampling=if (tuneRsmp=="none") NULL else tuneRsmp,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=1))
##Create an extensionless 'dummy file' to signal that the model has
## finished running
file.create(paste("Model Status/Resamples", tuneRsmp, "completed", sep="_"))
##Return the classifier
return(mod)
} %>%
##Name each classifier with its hyperparameter setting
set_names(tuneResamples)
##Stop the parallel cluster
stopCluster(cl)
resampEndTime <- proc.time()
Clear the model status dummy files:
# dir("Model Status", pattern="^Resamples", full.names=TRUE) %>% file.remove()
Create and save a summary object for this tuning step (note that we’ve written fstSummary() to handle a list of classifiers as well as a single classifier):
smryResamples <- fstListResamples %>% fstSummary()
smryResamples %>% saveRDS("Model Status/SmryResamples.Rds")
The resampled performance measures indicate that SMOTE sampling narrowly outperforms no resampling, which outperforms downsampling.
smryResamples$resample %>%
group_by(Model) %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
arrange(desc(AccAUC))
Because we tested each classifier using 25 bootstrap resamples, we can assess performance across a range of possible outcomes. The plot below shows the resampled overall accuracy \(\times\) AUC for each hyperparameter setting. The SMOTE and downsampling models are more variable in their performance than the non-resampled model; this is not surprising given that the process of resampling introduces an additional layer of randomness. Nevertheless, the SMOTE models generally outperform the downsampled and non-resampled models.
smryResamples$resample %>%
ggplot(aes(x=Model, y=AccAUC)) +
geom_violin() +
xlab("Additional resampling method") +
ylab("Overall accuracy × AUC")
Even though we’re optimizing our classifiers based on overall accuracy \(\times\) AUC, our summary function allows us to keep an eye on other performance metrics we might care about, like class accuracies. The plot below demonstrates that the choice of additional resampling method makes a considerable impact on class accuracy. As we suspected, the non-resampled model suffers from stratification between high accuracy for Absent and low accuracy for Present; compared to the non-resampled model, the SMOTE model has slightly worse accuracy for Absent, but much better accuracy for Present. This is a tradeoff we’re willing to make, and this tradeoff is reflected in overall accuracy \(\times\) AUC, where the SMOTE model edges out the non-resampled model.
smryResamples$resample %>%
gather("Class", "ClassAccuracy", starts_with("ClassAccuracy_")) %>%
mutate_at(vars(Class), str_remove, "ClassAccuracy_") %>%
ggplot(aes(x=Model, y=ClassAccuracy, fill=Class)) +
geom_violin() +
xlab("Additional resampling method") +
ylab("Class accuracy")
Again, caret automatically saves timing data in the random forest objects it creates, and the summary objects we’ve created store this timing information. Here, SMOTE takes substantially longer to run than no resampling or downsampling, thanks to the fact that it takes the extra step of creating synthetic minority class tokens. The time-saving benefits of parallel processing are also evident from the difference between user (the sum of running time on all cores) and elapsed (the actual time it took to run the models); it took 7 minutes to run the SMOTE model in parallel, but without parallel processing it would have taken nearly 25 minutes!
smryResamples$times
$none
$none$everything
user system elapsed
354.512 2.235 67.174
$none$final
user system elapsed
15.896 0.027 1.752
$down
$down$everything
user system elapsed
203.189 2.401 55.100
$down$final
user system elapsed
8.273 0.068 1.834
$smote
$smote$everything
user system elapsed
1141.469 6.938 395.779
$smote$final
user system elapsed
50.734 0.195 13.869
It’s also a good idea when you’re running these models to keep an eye on how much memory your forest lists are taking up by sitting in your R workspace. Here we see that our forest list with three different models takes up quite a bit of space. We’re running these forests on a platform where memory’s not a concern, but if it were, we’d certainly want to remove each forest list after extracting whatever summary information we want from the forest list and saving that summary to file.
fstListResamples %>% object.size() %>% format("Mb")
[1] "180.5 Mb"
The possible subset generation methods are the simple bootstrap estimator, the 0.632 bootstrap estimator, out-of-bag, k-fold cross-validation, repeated k-fold cross-validation, and leave-p-out cross-validation (aka leave-group-out cross-validation). Here, we use caret’s default parameter settings for each of these methods: 25 resamples for bootstrap, out-of-bag, and leave-p-out cross-validation; 10 folds for k-fold cross-validation and repeated k-fold cross-validation; and a training percentage of 75% for leave-p-out cross-validation. We’ll override one caret default, as we’ll use 5 repeats for repeated k-fold cross-validation (caret’s default of 1 repeat makes this method indistinguishable from non-repeated k-fold cross-validation).
Now, we generate our list of forests:
methodsStartTime <- proc.time()
tuneMethods <- c("boot", "boot632", "cv", "repeatedcv", "LGOCV")
cl <- makeCluster(tuneMethods %>% length())
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListMethods <-
foreach(tuneMth = tuneMethods,
.packages=c("caret","dplyr")) %dopar% {
file.create(paste("Model Status/Methods", tuneMth, "begun", sep="_"))
mod <- train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=500,
trControl=trainControl(method=tuneMth,
repeats=if (tuneMth=="repeatedcv") 5 else NA,
sampling="smote",
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=1))
file.create(paste("Model Status/Methods", tuneMth, "completed", sep="_"))
return(mod)
} %>%
set_names(tuneMethods)
stopCluster(cl)
methodsEndTime <- proc.time()
Clear the model status dummy files:
# dir("Model Status", pattern="^Methods", full.names=TRUE) %>% file.remove()
Create and save a summary object:
smryMethods <- fstListMethods %>% fstSummary()
smryMethods %>% saveRDS("Model Status/SmryMethods.Rds")
The resampling results give k-fold cross-validation as the best-performing model, although there’s not much difference between methods in terms of performance:
smryMethods$resample %>%
group_by(Model) %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
arrange(desc(AccAUC))
We can also plot the resampled results to get a better sense of the range of performance for each of these methods.
smryMethods$resample %>%
ggplot(aes(x=Model, y=AccAUC)) +
geom_violin()
Now that we’ve selected repeated k-fold cross-validation as our subset generation method, what’s the optimal number of folds and the optimal number of repeats? For folds, We’ll sample a space of options around the default of 10: 4, 6, 8, 10, 12, 14. For repeats, we’ll sample a space of options around 5: 3, 5, 7.
foldsRepsStartTime <- proc.time()
tuneListFoldsReps <- expand.grid(Folds = seq(4, 14, by=2),
Repeats = seq(3, 7, by=2))
tuneListFoldsReps <- tuneListFoldsReps %>% split(tuneListFoldsReps %>% nrow() %>% seq_len())
tuneListFoldsReps <- tuneListFoldsReps %>% set_names(tuneListFoldsReps %>% map(str_c, collapse=" "))
cl <- makeCluster(9)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListFoldsReps <-
foreach(tuneNmRp = tuneListFoldsReps,
.packages=c("caret","dplyr")) %dopar% {
file.create(paste("Model Status/FoldsReps", paste(tuneNmRp, collapse="&"), "begun", sep="_"))
mod <- train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=500,
trControl=trainControl(method="repeatedcv",
number=tuneNmRp$Folds,
repeats=tuneNmRp$Repeats,
sampling="smote",
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=1))
file.create(paste("Model Status/FoldsReps", paste(tuneNmRp, collapse="&"), "completed", sep="_"))
return(mod)
} %>%
set_names(names(tuneListFoldsReps))
stopCluster(cl)
foldsRepsEndTime <- proc.time()
Clear the model status dummy files:
# dir("Model Status", pattern="^FoldsReps", full.names=TRUE) %>% file.remove()
Create another summary object:
smryFoldsReps <- fstListFoldsReps %>% fstSummary()
smryFoldsReps$resample <- smryFoldsReps$resample %>%
separate(Model, c("NumFolds", "Repeats"), sep=" ")
smryFoldsReps %>% saveRDS("Model Status/SmryFoldsReps.Rds")
The resampling results give the model with 12 folds as the best model:
smryFoldsReps$resample %>%
group_by(NumFolds, Repeats) %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
arrange(desc(AccAUC))
ranger parameters (mtry, splitrule, min.node.size)Define an initial search space for these parameters. A typical value of mtry for classification is the square root of the number of predictors, rounded down to the nearest integer, which in the case of the /r/ data is 13; so we’ll try a few mtry values around 13: 9, 11, 13, 15, 17. splitrule has two values, "gini" and "extratrees", so we’ll try both. Finally, the original randomForest package (Liaw & Wiener, 2002) uses min.node.size defaults of 1 for classification and 5 for regression, so for good measure we’ll try 1, 5, and 10. With respect to the continuous parameters (mtry and min.node.size), the best classifiers in this initial round may indicate values toward the end of the ranges we tuned; if that’s the case, we’ll do another round of tuning these parameters with a better-informed range of values.
tuneListStartTime <- proc.time()
tuneList <- expand.grid(mtry=seq(9, 17, by=2),
splitrule=c("gini", "extratrees"),
min.node.size=c(1, 5, 10),
stringsAsFactors=FALSE)
tuneList <- tuneList %>% split(tuneList %>% nrow() %>% seq_len())
tuneList <- tuneList %>% set_names(tuneList %>% map(str_c, collapse=" "))
cl <- makeCluster(10)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListTuneGrid <-
foreach(tuneVals = tuneList,
.packages=c("caret","dplyr")) %dopar% {
file.create(paste("Model Status/TuneGrid",
paste(tuneVals, collapse=" "),
"begun", sep="_"))
mod <- train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=500,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=tuneVals)
file.create(paste("Model Status/TuneGrid",
paste(tuneVals, collapse=" "),
"completed", sep="_"))
return(mod)
} %>%
set_names(names(tuneList))
tuneListEndTime <- proc.time()
Clear the model status dummy files:
# dir("Model Status", pattern="^TuneGrid", full.names=TRUE) %>% file.remove()
To assess which tuning parameters are the best, we collect some of the most salient characteristics of the model (resampling results, confusion matrix, variable importance, and running time) into a summary object. We’ll also save this summary in its own .Rds file, in case something goes awry and we lose our R session (which is a risk when you’re trying to hold large objects in memory).
smryTuneGrid <- fstListTuneGrid %>% fstSummary()
smryTuneGrid %>% saveRDS("Model Status/SmryTuneGrid.Rds")
The resampled performance measures indicate the best performance with mtry=15, splitrule="gini", and min.node.size=1
smryTuneGrid$resample %>%
group_by(mtry, splitrule, min.node.size) %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
arrange(desc(AccAUC))
Finally, we can adjust the number of trees in our forest. We’ve been using 500 trees, but it seems reasonable to try smaller or larger forests. We’ll try forests of size 250, 500, 750, and 1000; if it seems that our forests’ performance only increases with greater size, we can try larger forests still.
treesStartTime <- proc.time()
tuneNumTrees <- seq(250, 1000, by=250)
cl <- makeCluster(tuneNumTrees %>% length())
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListNumTrees <-
foreach(tuneNmTr = tuneNumTrees,
.packages=c("caret","dplyr")) %dopar% {
file.create(paste("Model Status/NumTrees", tuneNmTr, "begun", sep="_"))
mod <- train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=tuneNmTr,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
file.create(paste("Model Status/NumTrees", tuneNmTr, "completed", sep="_"))
return(mod)
} %>%
set_names(tuneNumTrees)
stopCluster(cl)
treesEndTime <- proc.time()
Clear the model status dummy files:
# dir("Model Status", pattern="^NumTrees", full.names=TRUE) %>% file.remove()
Create and save another summary object:
smryNumTrees <- fstListNumTrees %>% fstSummary()
smryNumTrees %>% saveRDS("Model Status/SmryNumTrees.Rds")
The resampled performance measures are best for the model with 750 trees, so we’ll keep that setting, though we acknowledge that there is no real difference in performance between 250, 500, 750, and 1000 trees (that is, if our model took a very long time to run, we could get away with using just 250 trees).
smryNumTrees$resample %>%
group_by(Model) %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
arrange(desc(AccAUC))
After the tuning process, our classifier has the following hyperparameter settings:
Additional resampling: SMOTE
Training/test subset generation method: k-fold cross-validation
Training/test subset generation parameters: 10 folds
ranger parameters: mtry=13, splitrule="gini", min.node.size=10
Number of trees: 750
Let’s train another classifier with these parameter settings to get a snapshot of performance after tuning hyperparameters (since we’re only training one model, we’ll dispatch with the parallel backend):
tunedStartTime <- proc.time()
registerDoSEQ()
set.seed(302302)
fstTuned <- train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
tunedEndTime <- proc.time()
Create and save a summary object:
smryTuned <- fstTuned %>% fstSummary()
smryTuned %>% saveRDS("Model Status/SmryTuned.Rds")
Here are the performance measures for our tuned classifier:
smryTuned$resample %>%
summarise_if(is.numeric, mean, na.rm=TRUE)
Finally, we remove forest objects to avoid overtaxing memory (depending on your system, it might be a good idea to do this after each hyperparameter), except for fstTuned.
rm(fstListMethods, fstListFoldsReps, fstListNumTrees, fstListResamples, fstListTuneGrid)
So far, we’ve been training our classifiers on the entire set of hand-coded data available to us, without excluding any outliers, but it’s worth questioning whether this is the right approach. One the one hand, bigger is better from the perspective of both training and testing: our sample size simulations demonstrate the positive effect of training set size on classifier performance, and the more permissive we are about outliers, the fewer test set tokens we throw out as ‘uncodeable.’ On the other hand, the principle of ‘garbage in, garbage out’ warns us against trusting the predictions of a model trained on bad data.
We already know how our model performs when we’re maximally permissive about outliers. Let’s train another model where we’re maximally conservative about outliers, discarding every token with even a single outlying measurement:
noOutStartTime <- proc.time()
registerDoSEQ()
set.seed(302302)
fstNoOutlier <- train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
filter_at(vars(ends_with("_Outlier")), all_vars(!.)) %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
As the performance measures below show, this no-outlier classifier improves considerably over our previous model.
noOutEndTime <- proc.time()
fstNoOutlier$resample %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
select(AccAUC, Accuracy, AUC, everything())
rm(fstNoOutlier)
But what is the cost for this improved performance? If we only keep tokens with zero outlying measurements, we throw out about 40% of our tokens for both the hand-coded training set and the auto-coded test set. This is because we have a lot of formant timepoint measures, so the likelihood of outliers is high even if the likelihood of meaningful outliers is low. Indeed, only 5% of all formant timepoint measurements are outliers (a direct consequence of our 2-SD cutoff); among tokens with at least one outlying measurement, half have fewer than 4 outliers.
trainingData %>%
mutate(PossOutlier = apply(select(., ends_with("_Outlier")), 1, any)) %$%
table(HowCoded, PossOutlier)
PossOutlier
HowCoded FALSE TRUE
Auto 17204 15915
Hand 2978 2642
trainingData %>%
select(ends_with("_Outlier")) %>%
summarise_all(mean) %>%
apply(1, mean)
[1] 0.04197241
trainingData %>%
mutate(NumOutlier = apply(select(., ends_with("_Outlier")), 1, sum)) %>%
filter(NumOutlier > 0) %>%
ggplot(aes(x=NumOutlier)) +
geom_histogram(binwidth=1) +
facet_grid(HowCoded ~ .)
In other words, where outliers are concerned, we have to achieve a balance. It’s likely that some outliers hurt our classifier’s performance so much that our classifier is better without them, and other outliers hurt performance to such a minor degree that it’s worth keeping those tokens. In addition, with so many acoustic measures, the likelihood of outliers is high, so we want to be careful about which outliers we can afford to keep and which ones are detrimental to performance.
Here is the procedure we’ll follow. We’ll train a set of 10 classifiers (to lessen statistical noise) and obtain a ranking of which measures’ outliers most hurt classifier performance. We’ll train another set of 10 classifiers where we drop outliers in the top-ranked measure, and we’ll compare performance in the classifiers that drop outliers to the ones that don’t drop outliers. If there is a significant difference in performance, we will consider the loss in data to be worth the gain in performance; we’ll obtain an updated ranking of measures from this set of classifiers, and we’ll proceed to train another set of classifiers that drops outliers in the top-ranked measure. If there is no significant difference in performance, we’ll move down to the next measure in the ranking to see if dropping its outliers leads to a significant gain in performance. Once we reach three consecutive nonsignificant results, we’ll stop. By way of this procedure, we’ll determine which outliers are most detrimental to performance and which ones we can afford to keep.
To lessen the effect of statistical noise, we’ll train 5 models for each step. We’ll use parallel processing as before, but with one additional bit of machinery. Previously, we ensured reproducibility by using parallel::clusterSetRNGStream() to pass the same random seed to each parallel process. If we did the same thing here, we would end up with 5 repetitions of the exact same model running in parallel! Not exactly the best use of parallel processing. As a result, we’ll also pass different vectors of random seeds to the seeds argument of trainControl(), ensuring that train() generates different training and test subsets from one parallel model to the next.
registerDoSEQ()
set.seed(302302)
##For repeated k-fold CV with n repeats, trainControl(seeds) must be a list of
## (n*k)+1 lists, each containing M integers (M = number of models being
## evaluated within each call to train())
seedList <- replicate(5,
replicate(37, sample.int(10000, 1), simplify=FALSE),
simplify=FALSE)
First, train a list of 5 classifiers on the full data set (i.e., without dropping any outliers):
drop0StartTime <- proc.time()
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut0 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
drop0EndTime <- proc.time()
Extract and save a summary:
smryDropOut0 <- fstListDropOut0 %>% fstSummary()
smryDropOut0 %>% saveRDS("Model Status/SmryDropOut0.Rds")
For each forest in this list, we can assess the effect of each outlier on performance by comparing the proportions of outliers for different measures in each fold and each fold’s performance. We pull each token’s fold assignment from the pred element of each forest and add it to each token’s outlier data:
fstListDropOut0[[1]] %>%
extract2("pred") %>%
arrange(rowIndex) %>%
select(Resample) %>%
cbind(trainingData %>% filter(HowCoded=="Hand") %>% select(ends_with("_Outlier")))
From this data, we can find the proportion of outliers for each measure in each fold (within each repetition):
fstListDropOut0[[1]] %>%
extract2("pred") %>%
arrange(rowIndex) %>%
select(Resample) %>%
cbind(trainingData %>% filter(HowCoded=="Hand") %>% select(ends_with("_Outlier"))) %>%
group_by(Resample) %>%
summarise_all(mean)
We then take performance data from the resample element of the forest and join it to the outlier data:
fstListDropOut0[[1]] %>%
extract2("resample") %>%
select(AccAUC, Resample)
fstListDropOut0[[1]] %>%
extract2("pred") %>%
arrange(rowIndex) %>%
select(Resample) %>%
cbind(trainingData %>% filter(HowCoded=="Hand") %>% select(ends_with("_Outlier"))) %>%
group_by(Resample) %>%
summarise_all(mean) %>%
left_join(fstListDropOut0[[1]] %>%
extract2("resample") %>%
select(AccAUC, Resample),
by="Resample")
After we put this data together for all 5 classifiers in the list, we pass the resulting dataframe to cor() to get correlations, and we pull out just the correlations we care about: the correlations between AccAUC and the proportions of outliers. Finally, we store the data in a dataframe and, to save memory, remove the forest list from our R environment. The following function puts all these calculations together:
##Find the outliers that most hurt classifier performance by calculating the
## correlation between folds' test performance and proportion of outliers
##This code works for repeated k-fold cross-validation, but it'll need to be
## tweaked for different methods of generating training/test subsets.
outlierCors <- function(fstList, data=trainingData, dropped=NULL) {
##Get fold info (performance and proportion of outliers) for each forest; if
## outliers have already been dropped, don't include them in the calculation
if (is.null(dropped) | length(dropped)==0) {
foldInfo <- function(fst, data, dropped) {
data %>%
filter(HowCoded=="Hand") %>%
cbind(fst$pred %>% arrange(rowIndex) %>% select(Resample)) %>%
group_by(Resample) %>%
summarise_at(vars(ends_with("_Outlier")), mean, na.rm=TRUE) %>%
left_join(fst$resample %>% select(AccAUC, Resample), by="Resample")
}
} else {
foldInfo <- function(fst, data, dropped) {
data %>%
filter(HowCoded=="Hand") %>%
filter_at(vars(dropped), all_vars(!.)) %>%
cbind(fst$pred %>% arrange(rowIndex) %>% select(Resample)) %>%
group_by(Resample) %>%
summarise_at(vars(ends_with("_Outlier")), mean, na.rm=TRUE) %>%
select(-dropped) %>%
left_join(fst$resample %>% select(AccAUC, Resample), by="Resample")
}
}
##Get fold info for all forests and put them together into a dataframe
ret <-
fstList %>%
map_dfr(foldInfo, data=data, dropped=dropped) %>%
select(-Resample) %>%
select(AccAUC, everything()) %>%
##Get correlations between AccAUC and outlier proportions
cor() %>%
##Reduce correlation matrix to a vector
extract(1, -1) %>%
sort() %>%
tibble(Measure = names(.), TestCor = .)
ret
}
Running this function on our forest list with no outliers dropped, we find an initial ranking of outliers with F3_80 as the outlier that most hurts classifier performance. We’ll try removing F3_80 outliers first.
(negOutliers <- fstListDropOut0 %>% outlierCors(dropped=NULL))
rm(fstListDropOut0)
Start by initializing a character vector (empty, for now) where we’ll store measures that we decide to drop:
dropped <- character(0L)
Now train another list of 5 classifiers, this time dropping tokens with outliers in the measure with the biggest negative impact on performance, F3_80:
drop11StartTime <- proc.time()
tryDrop <- negOutliers$Measure[1]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut1_1 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
drop11EndTime <- proc.time()
Create and save a summary object:
smryDropOut1_1 <- fstListDropOut1_1 %>% fstSummary()
smryDropOut1_1 %>% saveRDS("Model Status/SmryDropOut1_1.Rds")
Now we can compare the overall accuracy \(\times\) AUC for the 10 classifiers that drop no outliers vs. the overall accuracy \(\times\) AUC for the 10 classifiers that drop tokens with F3_80 outliers. The latter model well outperforms the former, indicating that the performance gain from dropping F3_80 outliers is worth the loss in data.
list(smryDropOut0, smryDropOut1_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
Now, we’ll proceed to drop outliers for one measure at a time and compare classifier performance after each step using Wilcoxon rank-sum tests (rather than its parametric counterpart, an independent samples t-test). The distributions of classifier performance fail to satisfy the assumption of normality that is necessary to use an independent samples t-test, so instead we compare performance using a Wilcoxon rank-sum test. We’ll drop outliers until we get three consecutive results showing that classifier performance fails to significantly improve at \(\alpha\) = .05. In this case, we find a highly significant increase in classifier performance, which justifies dropping tokens with outlying F3_80 measurements.
list(smryDropOut0, smryDropOut1_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum exact test
data: AccAUC by Dropped
W = 46, p-value = 0.00247
alternative hypothesis: true location shift is less than 0
So we add F3_80 to our established set of dropped outliers, meaning we drop about 3% of tokens:
dropped <- c(dropped, tryDrop)
tibble(Dropped = "None",
Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>%
rbind(
cbind(Dropped = dropped,
map_dfr(dropped %>% seq_along(),
~ trainingData %>%
filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
count(HowCoded) %>%
spread(HowCoded, n)
)
)
) %>%
mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))
We re-rank measures based on the results of this list of classifiers:
(negOutliers <- fstListDropOut1_1 %>% outlierCors(dropped=dropped))
Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(dropped)` instead of `dropped` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in F3_80 and intens_F3max:
drop21StartTime <- proc.time()
rm(fstListDropOut1_1)
tryDrop <- negOutliers$Measure[1]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut2_1 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
drop21EndTime <- proc.time()
Create and save a summary object:
smryDropOut2_1 <- fstListDropOut2_1 %>% fstSummary()
smryDropOut2_1 %>% saveRDS("Model Status/SmryDropOut2_1.Rds")
The plot of performance suggests little performance gain from dropping intens_F3max outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test fails to indicate a significant performance gain from dropping intens_F3max outliers, compared to our established set of dropped outliers.
list(smryDropOut1_1, smryDropOut2_1) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum exact test
data: AccAUC by Dropped
W = 0, p-value = 6.447e-09
alternative hypothesis: true location shift is less than 0
So we move on to the next measure in the ranking, F2_35, training a list of classifiers on data that drops outliers in F2_35 in addition to our established set of dropped outliers.
drop22StartTime <- proc.time()
rm(fstListDropOut2_1)
tryDrop <- negOutliers$Measure[2]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut2_2 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
drop22EndTime <- proc.time()
Create and save a summary object:
smryDropOut2_2 <- fstListDropOut2_2 %>% fstSummary()
smryDropOut2_2 %>% saveRDS("Model Status/SmryDropOut2_2.Rds")
The plot of performance suggests further performance gain from dropping F2_35 outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test indicates a significant performance gain from dropping F2_35 outliers, compared to our established set of dropped outliers.
list(smryDropOut1_1, smryDropOut2_2) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum exact test
data: AccAUC by Dropped
W = 10, p-value = 8.961e-07
alternative hypothesis: true location shift is less than 0
So we add F2_35 to our established set of dropped outliers, meaning we drop about 8% of tokens:
dropped <- c(dropped, tryDrop)
tibble(Dropped = "None",
Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>%
rbind(
cbind(Dropped = dropped,
map_dfr(dropped %>% seq_along(),
~ trainingData %>%
filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
count(HowCoded) %>%
spread(HowCoded, n)
)
)
) %>%
mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))
We re-rank measures based on the results of this list of classifiers:
(negOutliers <- fstListDropOut2_2 %>% outlierCors(dropped=dropped))
And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in intens_F3min, in addition to our established set of dropped outliers:
drop31StartTime <- proc.time()
rm(fstListDropOut2_2)
tryDrop <- negOutliers$Measure[1]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut3_1 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
drop31EndTime <- proc.time()
Create and save a summary object:
smryDropOut3_1 <- fstListDropOut3_1 %>% fstSummary()
smryDropOut3_1 %>% saveRDS("Model Status/SmryDropOut3_1.Rds")
The plot of performance suggests further performance gain from dropping intens_F3min outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test indicates a significant performance gain from dropping intens_F3min outliers, compared to our established set of dropped outliers.
list(smryDropOut2_2, smryDropOut3_1) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum exact test
data: AccAUC by Dropped
W = 92, p-value = 0.2062
alternative hypothesis: true location shift is less than 0
So we add intens_F3min to our established set of dropped outliers, meaning we drop about 9.5% of tokens:
dropped <- c(dropped, tryDrop)
tibble(Dropped = "None",
Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>%
rbind(
cbind(Dropped = dropped,
map_dfr(dropped %>% seq_along(),
~ trainingData %>%
filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
count(HowCoded) %>%
spread(HowCoded, n)
)
)
) %>%
mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))
We re-rank measures based on the results of this list of classifiers:
(negOutliers <- fstListDropOut3_1 %>% outlierCors(dropped=dropped))
And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in F1_70, in addition to our established set of dropped outliers:
drop41StartTime <- proc.time()
rm(fstListDropOut3_1)
tryDrop <- negOutliers$Measure[1]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut4_1 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
drop41EndTime <- proc.time()
Create and save a summary object:
smryDropOut4_1 <- fstListDropOut4_1 %>% fstSummary()
smryDropOut4_1 %>% saveRDS("Model Status/SmryDropOut4_1.Rds")
The plot of performance suggests further performance gain from dropping F1_70 outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test fails to indicate a significant performance gain from dropping intens_F3max outliers, compared to our established set of dropped outliers.
list(smryDropOut3_1, smryDropOut4_1) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum exact test
data: AccAUC by Dropped
W = 31, p-value = 0.000197
alternative hypothesis: true location shift is less than 0
So we add F1_70 to our established set of dropped outliers, meaning we drop about 13% of tokens:
dropped <- c(dropped, tryDrop)
tibble(Dropped = "None",
Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>%
rbind(
cbind(Dropped = dropped,
map_dfr(dropped %>% seq_along(),
~ trainingData %>%
filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
count(HowCoded) %>%
spread(HowCoded, n)
)
)
) %>%
mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))
We re-rank measures based on the results of this list of classifiers:
(negOutliers <- fstListDropOut4_1 %>% outlierCors(dropped=dropped))
And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in F1_50, in addition to our established set of dropped outliers:
drop51StartTime <- proc.time()
rm(fstListDropOut4_1)
tryDrop <- negOutliers$Measure[1]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut5_1 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
drop51EndTime <- proc.time()
Create and save a summary object:
smryDropOut5_1 <- fstListDropOut5_1 %>% fstSummary()
smryDropOut5_1 %>% saveRDS("Model Status/SmryDropOut5_1.Rds")
The plot of performance suggests some further performance gain from dropping F1_50 outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1, smryDropOut5_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test indicates a significant performance gain from dropping F1_50 outliers, compared to our established set of dropped outliers.
list(smryDropOut4_1, smryDropOut5_1) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum exact test
data: AccAUC by Dropped
W = 61, p-value = 0.01647
alternative hypothesis: true location shift is less than 0
So we add F1_50 to our established set of dropped outliers, meaning we drop about 15% of tokens:
dropped <- c(dropped, tryDrop)
tibble(Dropped = "None",
Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>%
rbind(
cbind(Dropped = dropped,
map_dfr(dropped %>% seq_along(),
~ trainingData %>%
filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
count(HowCoded) %>%
spread(HowCoded, n)
)
)
) %>%
mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))
We re-rank measures based on the results of this list of classifiers:
(negOutliers <- fstListDropOut5_1 %>% outlierCors(dropped=dropped))
And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in F3min, in addition to our established set of dropped outliers:
drop61StartTime <- proc.time()
rm(fstListDropOut5_1)
tryDrop <- negOutliers$Measure[1]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut6_1 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
drop61EndTime <- proc.time()
Create and save a summary object:
smryDropOut6_1 <- fstListDropOut6_1 %>% fstSummary()
smryDropOut6_1 %>% saveRDS("Model Status/SmryDropOut6_1.Rds")
The plot of performance suggests some further performance gain from dropping F3min outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1, smryDropOut5_1, smryDropOut6_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test indicates a significant performance gain from dropping F3min outliers, compared to our established set of dropped outliers.
list(smryDropOut5_1, smryDropOut6_1) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum exact test
data: AccAUC by Dropped
W = 6, p-value = 1.934e-07
alternative hypothesis: true location shift is less than 0
So we add F3min to our established set of dropped outliers, meaning we drop about 17% of tokens:
dropped <- c(dropped, tryDrop)
tibble(Dropped = "None",
Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>%
rbind(
cbind(Dropped = dropped,
map_dfr(dropped %>% seq_along(),
~ trainingData %>%
filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
count(HowCoded) %>%
spread(HowCoded, n)
)
)
) %>%
mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))
We re-rank measures based on the results of this list of classifiers:
(negOutliers <- fstListDropOut6_1 %>% outlierCors(dropped=dropped))
And we continue the procedure by training another list of 5 classifiers, this time on data that drops tokens with outliers in F4_40, in addition to our established set of dropped outliers:
drop71StartTime <- proc.time()
rm(fstListDropOut6_1)
tryDrop <- negOutliers$Measure[1]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut7_1 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
drop71EndTime <- proc.time()
Create and save a summary object:
smryDropOut7_1 <- fstListDropOut7_1 %>% fstSummary()
smryDropOut7_1 %>% saveRDS("Model Status/SmryDropOut7_1.Rds")
The plot of performance suggests little performance gain from dropping F4_40 outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1, smryDropOut5_1, smryDropOut6_1, smryDropOut7_1) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test fails to indicate a significant performance gain from dropping F4_40 outliers, compared to our established set of dropped outliers.
list(smryDropOut6_1, smryDropOut7_1) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum exact test
data: AccAUC by Dropped
W = 62, p-value = 0.01836
alternative hypothesis: true location shift is less than 0
So we move on to the next measure in the ranking, F4_50, training a list of classifiers on data that drops outliers in F4_50 in addition to our established set of dropped outliers.
drop72StartTime <- proc.time()
rm(fstListDropOut7_1)
tryDrop <- negOutliers$Measure[2]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut7_2 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
drop72EndTime <- proc.time()
Create and save a summary object:
smryDropOut7_2 <- fstListDropOut7_2 %>% fstSummary()
smryDropOut7_2 %>% saveRDS("Model Status/SmryDropOut7_2.Rds")
The plot of performance suggests little performance gain from dropping F4_50 outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1, smryDropOut5_1, smryDropOut6_1, smryDropOut7_2) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test fails to indicate a significant performance gain from dropping F4_50 outliers, compared to our established set of dropped outliers.
list(smryDropOut6_1, smryDropOut7_2) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum exact test
data: AccAUC by Dropped
W = 16, p-value = 5.886e-06
alternative hypothesis: true location shift is less than 0
So we move on to the next measure in the ranking, F1_60, training a list of classifiers on data that drops outliers in F1_60 in addition to our established set of dropped outliers. Since that’s been two consecutive nonsignificant results, we’ll stop this process if the next result is also nonsignificant.
drop73StartTime <- proc.time()
rm(fstListDropOut7_2)
tryDrop <- negOutliers$Measure[3]
cl <- makeCluster(5)
registerDoParallel(cl)
clusterSetRNGStream(cl, 302302)
fstListDropOut7_3 <- foreach(seed = seedList,
.packages=c("caret","dplyr")) %dopar% {
train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(c(dropped, tryDrop)), all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
seeds=seed,
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
}
stopCluster(cl)
drop73EndTime <- proc.time()
Create and save a summary object:
smryDropOut7_3 <- fstListDropOut7_3 %>% fstSummary()
smryDropOut7_3 %>% saveRDS("Model Status/SmryDropOut7_3.Rds")
The plot of performance suggests little performance gain from dropping F1_60 outliers, compared to our established set of dropped outliers.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1, smryDropOut5_1, smryDropOut6_1, smryDropOut7_3) %>%
set_names(c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped, tryDrop) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
A Wilcoxon rank-sum test fails to indicate a significant performance gain from dropping F1_60 outliers, compared to our established set of dropped outliers.
list(smryDropOut6_1, smryDropOut7_3) %>%
set_names(c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor,
levels=c(dropped[length(dropped)], tryDrop) %>% str_remove("_Outlier")) %>%
wilcox.test(AccAUC ~ Dropped, ., alternative="less", exact=TRUE)
Wilcoxon rank sum exact test
data: AccAUC by Dropped
W = 75, p-value = 0.06307
alternative hypothesis: true location shift is less than 0
Since that’s been three consecutive nonsignificant results, we stop our outlier-dropping process there.
rm(fstListDropOut7_3)
We’ve dropped outliers in six measures for which outliers significantly hurt classifier performance: F3_80, F2_35, intens_F3min, F1_70, F1_50, and F3min. We determined which outliers to drop not in terms of variable importance—most of these measures did not rank particularly high in variable importance—but in terms of which measures’ outliers were most negatively correlated with classifier performance.
dropped %>%
str_remove("_Outlier") %>%
set_names(., .) %>%
map_int(~ smryTuned$varImp %>% sort(decreasing=TRUE) %>% names() %>% equals(.x) %>% which())
intens_F3min F2_50 F4_55 F3_75 F2_20 F1_80
14 40 64 61 137 58
Dropping these outliers comes at the price of about 1/6 of all auto-coded and hand-coded tokens.
tibble(Dropped = "None",
Auto = trainingData %>% filter(HowCoded=="Auto") %>% nrow(),
Hand = trainingData %>% filter(HowCoded=="Hand") %>% nrow()) %>%
rbind(
cbind(Dropped = dropped,
map_dfr(dropped %>% seq_along(),
~ trainingData %>%
filter_at(vars(dropped[1:.x]), all_vars(!.)) %>%
count(HowCoded) %>%
spread(HowCoded, n)
)
)
) %>%
mutate_at(vars(Auto, Hand), list(PctTossed = ~ (.[1] - .) / .[1]))
The loss of these tokens pays off in an incremental increase in the performance of our classifier.
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1, smryDropOut5_1, smryDropOut6_1) %>%
set_names(c("None", dropped) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
separate(Resample, c("Fold","Rep"), "\\.") %>%
group_by(Model, Rep) %>%
summarise_at(vars(AccAUC), mean, na.rm=TRUE) %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped) %>% str_remove("_Outlier")) %>%
ggplot(aes(x=AccAUC, group=Dropped, fill=Dropped)) +
geom_density(alpha=0.5)
list(smryDropOut0, smryDropOut1_1, smryDropOut2_2, smryDropOut3_1,
smryDropOut4_1, smryDropOut5_1, smryDropOut6_1) %>%
set_names(c("None", dropped) %>% str_remove("_Outlier")) %>%
imap_dfr(~ .x %>%
extract2("resample") %>%
mutate(Dropped = .y)) %>%
mutate_at(vars(Dropped), factor, levels=c("None", dropped) %>% str_remove("_Outlier")) %>%
select(-Model) %>%
group_by(Dropped) %>%
summarise_if(is.numeric, mean, na.rm=TRUE)
Next, we’ll train a final classifier with the settings we’ve tuned.
finalStartTime <- proc.time()
registerDoSEQ()
set.seed(302302)
finalClassifier <- train(Rpresent ~ .,
data=trainingData %>%
filter_at(vars(F3_80_Outlier, F2_35_Outlier, intens_F3min_Outlier,
F1_70_Outlier, F1_50_Outlier, F3min_Outlier),
all_vars(!.)) %>%
filter(HowCoded=="Hand") %>%
select(Rpresent, TokenDur:absSlopeF0),
method="ranger",
importance="impurity",
num.trees=750,
trControl=trainControl(method="repeatedcv",
number=12,
repeats=3,
sampling="smote",
savePredictions=TRUE,
summaryFunction=cutoffSummary,
returnResamp="all",
classProbs=TRUE),
tuneGrid=data.frame(mtry=13, splitrule="gini", min.node.size=10))
finalEndTime <- proc.time()
Here are the properties of our final classifier: performance measures, confusion matrix, variable importance, and timing. We’ll save this final classifier to file.
finalClassifier %>% fstSummary()
$resample
AccAUC Accuracy AUC BestCutoff AccuracyHalf ClassAccuracy_Absent
1 0.7492338 0.8439898 0.8877285 0.4966434 0.8388747 0.8794326
2 0.7699420 0.8615385 0.8936827 0.5822598 0.8435897 0.9539007
3 0.7151271 0.8414322 0.8498926 0.5689011 0.8184143 0.9397163
4 0.7360015 0.8363171 0.8800508 0.5825537 0.8260870 0.9361702
5 0.7770717 0.8618926 0.9015876 0.5591434 0.8439898 0.9609929
6 0.7040322 0.8363171 0.8418245 0.6414161 0.8081841 0.9397163
7 0.7103110 0.8307692 0.8550039 0.5158831 0.8179487 0.9113475
8 0.7654916 0.8567775 0.8934544 0.5500653 0.8414322 0.9326241
9 0.7423077 0.8307692 0.8935185 0.4710042 0.8256410 0.8439716
10 0.7386690 0.8414322 0.8778710 0.6127693 0.8235294 0.9468085
11 0.7570093 0.8542199 0.8861995 0.5068587 0.8516624 0.9184397
12 0.7162556 0.8388747 0.8538291 0.5211365 0.8158568 0.9078014
13 0.7234014 0.8282051 0.8734568 0.6382508 0.8256410 0.9609929
14 0.7494377 0.8491049 0.8826209 0.5053540 0.8465473 0.9113475
15 0.7157915 0.8260870 0.8664845 0.6495841 0.8107417 0.9574468
16 0.7000082 0.8209719 0.8526579 0.5644640 0.8132992 0.9468085
17 0.7738050 0.8695652 0.8898757 0.5893984 0.8439898 0.9361702
18 0.7948897 0.8746803 0.9087774 0.5369598 0.8670077 0.9255319
19 0.7443902 0.8414322 0.8846704 0.5995884 0.8081841 0.9539007
20 0.7406902 0.8461538 0.8753612 0.5259228 0.8358974 0.9290780
21 0.7640892 0.8618926 0.8865248 0.5648476 0.8491049 0.9432624
22 0.7430284 0.8439898 0.8803761 0.6097016 0.8363171 0.9503546
23 0.7281709 0.8286445 0.8787494 0.6216042 0.8107417 0.9255319
24 0.7330484 0.8512821 0.8611111 0.6131339 0.8307692 0.9503546
25 0.7405847 0.8439898 0.8774806 0.5499471 0.8363171 0.9148936
26 0.7305783 0.8307692 0.8793998 0.5567836 0.8179487 0.9042553
27 0.7257802 0.8388747 0.8651832 0.5242381 0.8286445 0.9184397
28 0.8037395 0.8746803 0.9188952 0.4453212 0.8670077 0.9007092
29 0.7257177 0.8286445 0.8757889 0.5676569 0.8260870 0.9219858
30 0.7796243 0.8670077 0.8992127 0.5159556 0.8593350 0.9255319
31 0.7001048 0.8312020 0.8422799 0.6177553 0.8209719 0.9539007
32 0.7594345 0.8618926 0.8811243 0.6095299 0.8363171 0.9539007
33 0.7461148 0.8410256 0.8871487 0.6698905 0.8205128 0.9716312
34 0.6881501 0.8179487 0.8413121 0.6578153 0.7948718 0.9539007
35 0.7455440 0.8516624 0.8753985 0.5263952 0.8388747 0.9255319
36 0.7487396 0.8439898 0.8871430 0.5169365 0.8465473 0.8829787
ClassAccuracy_Present ClassAccuracyHalf_Absent ClassAccuracyHalf_Present
1 0.7522936 0.8794326 0.7339450
2 0.6203704 0.9078014 0.6759259
3 0.5871560 0.8758865 0.6697248
4 0.5779817 0.8900709 0.6605505
5 0.6055046 0.9148936 0.6605505
6 0.5688073 0.8546099 0.6880734
7 0.6203704 0.8936170 0.6203704
8 0.6605505 0.8900709 0.7155963
9 0.7962963 0.8581560 0.7407407
10 0.5688073 0.8758865 0.6880734
11 0.6880734 0.9148936 0.6880734
12 0.6605505 0.8723404 0.6697248
13 0.4814815 0.8900709 0.6574074
14 0.6880734 0.9042553 0.6972477
15 0.4862385 0.8758865 0.6422018
16 0.4954128 0.8971631 0.5963303
17 0.6972477 0.8794326 0.7522936
18 0.7431193 0.8971631 0.7889908
19 0.5504587 0.8617021 0.6697248
20 0.6296296 0.8971631 0.6759259
21 0.6513761 0.9042553 0.7064220
22 0.5688073 0.8900709 0.6972477
23 0.5779817 0.8687943 0.6605505
24 0.5925926 0.8829787 0.6944444
25 0.6605505 0.8829787 0.7155963
26 0.6388889 0.8617021 0.7037037
27 0.6330275 0.8936170 0.6605505
28 0.8073394 0.9184397 0.7339450
29 0.5871560 0.8900709 0.6605505
30 0.7155963 0.9113475 0.7247706
31 0.5137615 0.8936170 0.6330275
32 0.6238532 0.8865248 0.7064220
33 0.5000000 0.8794326 0.6666667
34 0.4629630 0.8723404 0.5925926
35 0.6605505 0.9042553 0.6697248
36 0.7431193 0.8758865 0.7706422
mtry splitrule min.node.size Resample
1 13 gini 10 Fold01.Rep1
2 13 gini 10 Fold02.Rep1
3 13 gini 10 Fold03.Rep1
4 13 gini 10 Fold04.Rep1
5 13 gini 10 Fold05.Rep1
6 13 gini 10 Fold06.Rep1
7 13 gini 10 Fold07.Rep1
8 13 gini 10 Fold08.Rep1
9 13 gini 10 Fold09.Rep1
10 13 gini 10 Fold10.Rep1
11 13 gini 10 Fold11.Rep1
12 13 gini 10 Fold12.Rep1
13 13 gini 10 Fold01.Rep2
14 13 gini 10 Fold02.Rep2
15 13 gini 10 Fold03.Rep2
16 13 gini 10 Fold04.Rep2
17 13 gini 10 Fold05.Rep2
18 13 gini 10 Fold06.Rep2
19 13 gini 10 Fold07.Rep2
20 13 gini 10 Fold08.Rep2
21 13 gini 10 Fold09.Rep2
22 13 gini 10 Fold10.Rep2
23 13 gini 10 Fold11.Rep2
24 13 gini 10 Fold12.Rep2
25 13 gini 10 Fold01.Rep3
26 13 gini 10 Fold02.Rep3
27 13 gini 10 Fold03.Rep3
28 13 gini 10 Fold04.Rep3
29 13 gini 10 Fold05.Rep3
30 13 gini 10 Fold06.Rep3
31 13 gini 10 Fold07.Rep3
32 13 gini 10 Fold08.Rep3
33 13 gini 10 Fold09.Rep3
34 13 gini 10 Fold10.Rep3
35 13 gini 10 Fold11.Rep3
36 13 gini 10 Fold12.Rep3
$confMat
Actual
Predicted Absent Present
Absent 9410 1565
Present 742 2350
$varImp
TokenDur F1_20 diffF3F1_20 BW1_20 F1_25 diffF3F1_25
30.764500 14.989296 27.774775 12.594526 14.898406 35.396210
BW1_25 F1_30 diffF3F1_30 BW1_30 F1_35 diffF3F1_35
11.363728 14.465964 56.675720 11.031246 15.037537 53.184867
BW1_35 F1_40 diffF3F1_40 BW1_40 F1_45 diffF3F1_45
11.644876 15.900869 88.474917 11.188914 20.208349 82.429002
BW1_45 F1_50 diffF3F1_50 BW1_50 F1_55 diffF3F1_55
11.109157 21.000783 98.752872 11.300654 22.814180 105.090890
BW1_55 F1_60 diffF3F1_60 BW1_60 F1_65 diffF3F1_65
10.484514 20.028613 101.340742 10.741350 21.784939 107.645116
BW1_65 F1_70 diffF3F1_70 BW1_70 F1_75 diffF3F1_75
11.684349 23.100147 85.321544 11.364907 17.120429 69.676064
BW1_75 F1_80 diffF3F1_80 BW1_80 F1min F1max
13.908952 16.264453 71.695080 13.982099 13.978461 16.244214
time_F1min time_F1max F1range time_F1range slopeF1 F2_20
10.716749 15.976890 15.492185 14.778713 15.017035 13.418690
diffF3F2_20 BW2_20 F2_25 diffF3F2_25 BW2_25 F2_30
18.594373 15.508765 14.383006 18.672362 16.197715 14.697420
diffF3F2_30 BW2_30 F2_35 diffF3F2_35 BW2_35 F2_40
22.556039 14.553842 18.917379 31.091438 14.986958 18.764349
diffF3F2_40 BW2_40 F2_45 diffF3F2_45 BW2_45 F2_50
57.603532 14.368941 21.673919 41.670748 14.341450 25.075409
diffF3F2_50 BW2_50 F2_55 diffF3F2_55 BW2_55 F2_60
33.157534 15.661945 26.726035 58.844238 15.921802 27.298895
diffF3F2_60 BW2_60 F2_65 diffF3F2_65 BW2_65 F2_70
70.402097 15.146769 27.666039 58.206411 15.455432 30.522271
diffF3F2_70 BW2_70 F2_75 diffF3F2_75 BW2_75 F2_80
67.469697 13.632785 29.495273 69.273594 14.613164 28.612871
diffF3F2_80 BW2_80 F2min F2max time_F2min time_F2max
59.781687 19.667721 29.827593 18.538668 14.969185 14.247801
F2range time_F2range slopeF2 F3_20 BW3_20 F3_25
15.528431 13.697598 13.794399 16.299933 12.529338 14.299296
BW3_25 F3_30 BW3_30 F3_35 BW3_35 F3_40
12.260252 16.969548 11.644058 15.191504 12.540938 19.668322
BW3_40 F3_45 BW3_45 F3_50 BW3_50 F3_55
11.902571 23.031201 11.924540 23.296989 12.930897 20.841978
BW3_55 F3_60 BW3_60 F3_65 BW3_65 F3_70
14.252782 23.552730 13.918116 22.068816 15.685204 27.661179
BW3_70 F3_75 BW3_75 F3_80 BW3_80 F3min
14.206712 24.406039 15.252917 22.914951 14.832923 46.059445
F3max time_F3min time_F3max F3range time_F3range slopeF3
16.779745 10.472060 10.696035 22.102011 12.300931 21.105624
intens_F3min intens_F3max F4_20 diffF4F3_20 BW4_20 F4_25
62.546789 57.160306 14.077990 17.383573 14.463928 13.705212
diffF4F3_25 BW4_25 F4_30 diffF4F3_30 BW4_30 F4_35
15.106839 12.998409 14.097109 14.712706 13.672993 16.610082
diffF4F3_35 BW4_35 F4_40 diffF4F3_40 BW4_40 F4_45
15.627823 13.311154 18.875092 16.406844 12.283367 18.648821
diffF4F3_45 BW4_45 F4_50 diffF4F3_50 BW4_50 F4_55
16.372688 12.600801 18.719656 17.039564 13.895799 16.920489
diffF4F3_55 BW4_55 F4_60 diffF4F3_60 BW4_60 F4_65
16.854955 12.618690 16.277981 15.287159 12.141336 15.556432
diffF4F3_65 BW4_65 F4_70 diffF4F3_70 BW4_70 F4_75
14.857038 11.304223 15.001727 12.751703 11.673830 14.855942
diffF4F3_75 BW4_75 F4_80 diffF4F3_80 BW4_80 F4min
13.773128 11.720099 16.228348 14.589646 14.368805 22.182335
F4max time_F4min time_F4max F4range time_F4range slopeF4
14.237622 14.364799 12.857766 17.628338 12.674008 18.742736
F0min F0max F0rangeST time_F0min time_F0max absSlopeF0
43.821598 40.135800 19.322108 9.391833 10.985579 12.539335
$times
$times$everything
user system elapsed
1725.510 23.025 401.573
$times$final
user system elapsed
52.869 0.862 12.695
finalClassifier %>% saveRDS("Data/ClassifierR.Rds")
Auto-coding is easy and fast. To get continuous probabilities, we pass the final classifier and non-hand-coded data (without the dependent variable and with relevant measures only) to predict() with type="prob".
predict()is what’s called a “generic function” in R, with lots of different methods for objects of different classes that are passed to it. Since our final classifier is an object of classtrain(seeclass(finalClassifier)), callingpredict(finalClassifier)passes thefinalClassifierobject to thepredict.train()function that’s loaded with thecaretpackage. To see all the methods for a generic function likepredict(), runmethods("predict").
preds <-
predict(finalClassifier,
trainingData %>%
filter_at(vars(F3_80_Outlier, F2_35_Outlier, intens_F3min_Outlier,
F1_70_Outlier, F1_50_Outlier, F3min_Outlier),
all_vars(!.)) %>%
filter(HowCoded=="Auto") %>%
select(TokenDur:absSlopeF0),
type="prob")
preds %>% head()
Then, as in the custom prediction function we defined above, you can use if_else() with a custom cutoff to convert continuous probabilities to binary.
bestCutoff <- mean(finalClassifier$resample$BestCutoff)
preds <- preds %>%
mutate(Rpresent = if_else(Present >= bestCutoff, "Present", "Absent") %>% factor())
preds %>% head()
In the case of our trainingData dataframe, where hand-coded and auto-coded tokens are interspersed, an extra step is required to add the predictions back into the dataframe. Here, we use the same trick as with the imputation dataframe in Step 1, creating a temporary dataframe for prediction that includes a unique identifier column (which we sneak out of the dataframe before prediction and sneak back in after prediction), joining classifier probabilities with the original trainingData dataframe, and recoding the Rpresent dependent variable only for the auto-coded tokens with non-NA classifier probabilities.
preds <-
trainingData %>%
filter_at(vars(F3_80_Outlier, F2_35_Outlier, intens_F3min_Outlier,
F1_70_Outlier, F1_50_Outlier, F3min_Outlier),
all_vars(!.)) %>%
filter(HowCoded=="Auto") %>%
select(MatchId, TokenDur:absSlopeF0)
preds <-
predict(finalClassifier, preds %>% select(-MatchId), type="prob") %>%
cbind(preds %>% select(MatchId))
trainingData <-
trainingData %>%
left_join(preds %>% select(MatchId, ProbPresent = Present), by="MatchId") %>%
mutate(Rpresent = case_when(
HowCoded=="Hand" ~ Rpresent %>% as.character(),
HowCoded=="Auto" & !is.na(ProbPresent) & ProbPresent >= bestCutoff ~ "Present",
HowCoded=="Auto" & !is.na(ProbPresent) & ProbPresent < bestCutoff ~ "Absent",
TRUE ~ NA_character_
) %>%
factor())
rm(preds)
And now we’ve got auto-coded tokens! Here’s a sampling:
set.seed(302302)
trainingData %>%
filter_at(vars(F3_80_Outlier, F2_35_Outlier, intens_F3min_Outlier,
F1_70_Outlier, F1_50_Outlier, F3min_Outlier),
all_vars(!.)) %>%
filter(HowCoded=="Auto") %>%
select(HowCoded, Rpresent, ProbPresent, TokenDur:absSlopeF0) %>%
sample_n(10)
Finally, we can extract resampled predictions (binary codes and classifier probabilities) for the hand-coded training set, as long as we’ve set savePredictions=TRUE inside trainControl(). Note that we do not recommend comparing classifier probabilities for hand-coded tokens to those of the auto-coded tokens because they are generated with slightly different processes (the auto-coded tokens are exposed to the entire classifier, the hand-coded tokens to a subset); nonetheless, obtaining the hand-coded probabilities can be useful for some applications (e.g., we used them to construct homogeneous vs. heterogeneous Present classes in the sample size simulations). The dataframe that results from the following code can be cbind()-ed to our existing hand-coded data. (Note also that this code works for the repeated k-fold cross-validation method we used, but the code would need to be tweaked for different methods of generating training/test subsets.)
finalClassifier$pred %>%
arrange(rowIndex) %>%
group_by(rowIndex) %>%
summarise(ProbPresent = mean(Present, na.rm=TRUE))
This guide was generated in an R session with the following information:
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS: /usr/lib64/libblas.so.3.4.2
LAPACK: /usr/lib64/liblapack.so.3.4.2
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel grid stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] ROCR_1.0-11 diptest_0.76-0 doParallel_1.0.16 iterators_1.0.13
[5] foreach_1.5.1 DMwR_0.4.1 knitr_1.33 caret_6.0-88
[9] lattice_0.20-44 ranger_0.13.1 magrittr_2.0.1 forcats_0.5.1
[13] stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4 readr_2.0.2
[17] tidyr_1.1.4 tibble_3.1.4 ggplot2_3.3.5 tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] nlme_3.1-152 fs_1.5.0 xts_0.12.1
[4] lubridate_1.7.10 httr_1.4.2 tools_4.1.0
[7] backports_1.2.1 bslib_0.3.0 utf8_1.2.2
[10] R6_2.5.1 rpart_4.1-15 DBI_1.1.1
[13] colorspace_2.0-2 nnet_7.3-16 withr_2.4.2
[16] tidyselect_1.1.1 curl_4.3.2 compiler_4.1.0
[19] cli_3.0.1 rvest_1.0.0 xml2_1.3.2
[22] labeling_0.4.2 bookdown_0.22 sass_0.4.0
[25] scales_1.1.1 proxy_0.4-26 digest_0.6.28
[28] rmarkdown_2.9 pkgconfig_2.0.3 htmltools_0.5.2
[31] highr_0.9 dbplyr_2.1.1 fastmap_1.1.0
[34] TTR_0.24.2 rlang_0.4.11 readxl_1.3.1
[37] quantmod_0.4.18 rstudioapi_0.13 farver_2.1.0
[40] jquerylib_0.1.4 generics_0.1.0 zoo_1.8-9
[43] jsonlite_1.7.2 ModelMetrics_1.2.2.2 Matrix_1.3-4
[46] Rcpp_1.0.7 munsell_0.5.0 fansi_0.5.0
[49] abind_1.4-5 lifecycle_1.0.1 stringi_1.7.4
[52] pROC_1.17.0.1 yaml_2.2.1 MASS_7.3-54
[55] plyr_1.8.6 recipes_0.1.16 crayon_1.4.1
[58] haven_2.4.1 splines_4.1.0 hms_1.1.1
[61] pillar_1.6.3 reshape2_1.4.4 codetools_0.2-18
[64] stats4_4.1.0 reprex_2.0.0 glue_1.4.2
[67] evaluate_0.14 data.table_1.14.2 modelr_0.1.8
[70] vctrs_0.3.8 tzdb_0.1.2 cellranger_1.1.0
[73] gtable_0.3.0 assertthat_0.2.1 xfun_0.24
[76] gower_0.2.2 prodlim_2019.11.13 broom_0.7.8
[79] e1071_1.7-7 class_7.3-19 survival_3.2-11
[82] timeDate_3043.102 lava_1.6.9 ellipsis_0.3.2
[85] ipred_0.9-11
##Pipe info from sacct into temp-file
sacctString <- paste0("sacct -j ", params$`SLURM_job-id`, " -o JobID,Elapsed,Partition,Reserved,ResvCPU,AllocCPUS,MaxRSS -P > tmp_", params$`SLURM_job-id`)
system(sacctString)
##Read info
info <- read_delim(paste0("tmp_", params$`SLURM_job-id`), delim="|", col_types="ctcttic")
resTime <-
info %>%
filter(!is.na(Partition),
JobID==params$`SLURM_job-id`) %>%
pull(Reserved) %>%
replace_na(0) %>%
as.double(units="secs")
##Generate missing code if there's no resTime
if (length(resTime)==0) resTime <- "Reservation time missing"
##Remove temp-file--comment out for testing
file.remove(paste0("tmp_", params$`SLURM_job-id`))
[1] TRUE
##Outfile name
outfile <- paste0(if_else(params$testing, "Timing_TEST", "Timing"), ".Rds")
stopTime <- proc.time()
timingSmry <- list(
jobID = params$`SLURM_job-id`,
nodelist = params$SLURM_nodeid,
testing = params$testing,
partition = params$SBATCH_partition,
nodes = params$SBATCH_nodes,
tasks = params$`SBATCH_tasks-per-node`,
cpus = params$`SBATCH_cpus-per-task`,
mem = params$SBATCH_mem,
reservTime = resTime,
startTime = startSysTime,
setupTime = setupEndTime - startTime,
saveLoadTime = saveLoadEndTime - saveLoadStartTime,
fstInitTime = fstInitEndTime - fstInitStartTime,
resampTime = resampEndTime - resampStartTime,
methodsTime = methodsEndTime - methodsStartTime,
foldsRepsTime = foldsRepsEndTime - foldsRepsStartTime,
tuneListTime = tuneListEndTime - tuneListStartTime,
tunedTime = tunedEndTime - tunedStartTime,
noOutTime = noOutEndTime - noOutStartTime,
drop0Time = drop0EndTime - drop0StartTime,
drop11Time = drop11EndTime - drop11StartTime,
drop21Time = drop21EndTime - drop21StartTime,
drop22Time = drop22EndTime - drop22StartTime,
drop31Time = drop31EndTime - drop31StartTime,
drop41Time = drop41EndTime - drop41StartTime,
drop51Time = drop51EndTime - drop51StartTime,
drop61Time = drop61EndTime - drop61StartTime,
drop71Time = drop71EndTime - drop71StartTime,
drop72Time = drop72EndTime - drop72StartTime,
drop73Time = drop73EndTime - drop73StartTime,
totalTime = stopTime - startTime
)
timingSmry %>% saveRDS(outfile)
timingSmry
$jobID
[1] 4481119
$nodelist
[1] "smp-n28"
$testing
[1] FALSE
$partition
[1] "smp"
$nodes
[1] 1
$tasks
[1] 4
$cpus
[1] 4
$mem
[1] 60000
$reservTime
[1] 107
$startTime
[1] "2021-10-06 11:00:13 EDT"
$setupTime
user system elapsed
2.363 0.302 20.958
$saveLoadTime
user system elapsed
5.378 0.070 5.559
$fstInitTime
user system elapsed
8970.641 13.795 894.351
$resampTime
user system elapsed
3.561 0.500 401.408
$methodsTime
user system elapsed
7.175 1.430 807.927
$foldsRepsTime
user system elapsed
22.678 4.988 3569.458
$tuneListTime
user system elapsed
37.177 9.464 3505.558
$tunedTime
user system elapsed
2319.748 27.297 520.749
$noOutTime
user system elapsed
798.062 9.943 168.803
$drop0Time
user system elapsed
4.278 1.642 1109.866
$drop11Time
user system elapsed
4.166 1.651 1115.601
$drop21Time
user system elapsed
4.172 1.573 1022.830
$drop22Time
user system elapsed
4.165 1.604 1032.348
$drop31Time
user system elapsed
4.622 2.081 1002.444
$drop41Time
user system elapsed
4.141 1.646 922.612
$drop51Time
user system elapsed
4.110 1.811 910.248
$drop61Time
user system elapsed
4.187 1.797 842.017
$drop71Time
user system elapsed
4.107 1.707 825.402
$drop72Time
user system elapsed
4.090 1.761 779.808
$drop73Time
user system elapsed
4.092 1.699 809.448
$totalTime
user system elapsed
14592.494 247.922 22315.589
Breen, M. S., Thomas, K. G., Baldwin, D. S., & Lipinska, G. (2019). Modelling PTSD diagnosis using sleep, memory, and adrenergic metabolites: An exploratory machine-learning study. Human Psychopharmacology: Clinical and Experimental, 34(2), e2691. https://doi.org/10.1002/hup.2691
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Freeman, J. B., & Dale, R. (2013). Assessing bimodality to detect the presence of a dual cognitive process. Behavior Research Methods (Online), 45(1), 83–97. https://doi.org/10.3758/s13428-012-0225-x
Fromont, R., & Hay, J. (2012). LaBB-CAT: An annotation store. Proceedings of Australasian Language Technology Association Workshop, 113–117. Retrieved from http://www.aclweb.org/anthology/U12-1015
Hartigan, J. A., & Hartigan, P. M. (1985). The Dip Test of Unimodality. Annals of Statistics, 13(1), 70–84. https://doi.org/10.1214/aos/1176346577
Hay, J., & Clendon, A. (2012). (Non)rhoticity: Lessons from New Zealand English. In T. Nevalainen & E. C. Traugott (Eds.), The Oxford Handbook of the history of English (pp. 761–772). Oxford: Oxford University Press.
Heselwood, B. (2009). Rhoticity without F3: Lowpass filtering and the perception of rhoticity in ’NORTH/FORCE,’ ’START,’ and ’NURSE’ words. Leeds Working Papers in Linguistics and Phonetics, 14, 49–64. https://doi.org/10.1.1.500.6321
Hudak, A. T., Crookston, N. L., Evans, J. S., Hall, D. E., & Falkowski, M. J. (2008). Nearest neighbor imputation of species-level, plot-scale forest structure attributes from LiDAR data. Remote Sensing of Environment, 112(5), 2232–2245. https://doi.org/https://doi.org/10.1016/j.rse.2007.10.009
Kuhn, M. (2018). Caret. Retrieved from https://CRAN.R-project.org/package=caret
Lawson, E., Stuart-Smith, J., & Scobbie, J. (2018). The role of gesture delay in coda /r/ weakening: An articulatory, auditory and acoustic study. Journal of the Acoustical Society of America, 143(3), 1646–1657. https://doi.org/10.1121/1.5027833
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3), 18–22. Retrieved from https://CRAN.R-project.org/doc/Rnews/
Maechler, M. (2016). Diptest. Retrieved from https://cran.r-project.org/package=diptest
R Core Team. (2018). R: A language and environment for statistical computing. Retrieved from https://www.R-project.org/
Torgo, L. (2010). Data Mining with R, learning with case studies. Retrieved from http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR
Wright, M. N., & Ziegler, A. (2017). Ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1–17. https://doi.org/10.18637/jss.v077.i01